Failed Runs

Dealing with Failures

In a large dataset, it’s not impossible to imagine that some runners will fail to run for unforseen circumstances. Failures can occur at any point: In the shell, scheduler, or python, for example. Runners will still be marked as “satisfied”, if that is the case, but the a summary of the error message will be available in ds.errors.

However it is not enough to know that your calculation has failed, so lets explore some tools that help you figure out why they failed.

We’ll cover some common failure modes, so you can get a sense of what they look like:

  • Function Error

  • Argument Error

  • Submission Error

  • Walltime Issue

Function Error

There is a disconnect (however small) between writing and running the function. This can lead to small issues that ultimately cause the job to fail.

For this example, we’ll simulate a broken function by attempting to access a variable that doesn’t exist:

[2]:
import time
from remotemanager import Dataset

def multiply(a, b):

    foo

    return a * b

ds = Dataset(multiply, skip=False)

ds.append_run({'a': 2, 'b': 2})

ds.run()
ds.wait(1, 10)

ds.fetch_results()
appended run runner-0
Staging Dataset... Staged 1/1 Runners
Transferring for 1/1 Runners
Transferring 5 Files... Done
Remotely executing 1/1 Runners
Fetching results
Transferring 1 File... Done

Run complete, lets see what happened:

[3]:
ds.results
Warning! Found 1 error(s), also check the `errors` property!
[3]:
[RunnerFailedError('NameError: name 'foo' is not defined')]

No results, and a warning saying that there is something in the errors property, lets check it.

[4]:
ds.errors
[4]:
["NameError: name 'foo' is not defined"]

Here’s the error we were expecting.

Key indicators of failure are:

  • An unexpected None result

  • Content in the errors property

Note

It is possible to have a populated error file, but a sucessful run (some schedulers put warnings in stderr). This is why the message for this is only a warning. We will see this later in this tutorial.

Function Fixes

Since the identity of the dataset is tied heavily to the function, the only option for fixing the function is to create a new dataset.

If you already have submitted runs that you don’t want to resubmit, however, you can copy them across to your new dataset, preserving their status. This is best done by ds_new.copy_runners(ds).

Lets fix this function and rerun:

[5]:
def multiply(a, b):
    return a * b

ds_fixed = Dataset(multiply, skip=False)

ds_fixed.copy_runners(ds)

ds_fixed.runners
[5]:
[dataset-6fd64b82-runner-0]

Now we have our runner in our new dataset. This works because while the Dataset handles the function, a Runner only cares about the arguments. So as the function signatures match, this copy across will allow you to preserve your work.

Note

You can also select runners to insert, using ds.insert_runner(runner). Internally copy_runners uses this function by looping over the runners property of the given dataset.

Note how since the runners are copied across unchanged, they retain their run state. So if we want to rerun, we must force:

[6]:
ds_fixed.run()
Staging Dataset... No Runners staged
No Transfer required
[6]:
False
[7]:
ds_fixed.run(force=True)

ds_fixed.wait(1, 10)

ds_fixed.fetch_results()
Staging Dataset... Staged 1/1 Runners
Transferring for 1/1 Runners
Transferring 5 Files... Done
Remotely executing 1/1 Runners
Fetching results
Transferring 2 Files... Done
[8]:
ds_fixed.results
[8]:
[4]

Argument Error

When generating runs, sometimes the arguments themselves can be at fault. We can demonstrate this simply by adding a runner for the multiply function that has None as one of the args.

[10]:
ds = Dataset(multiply, skip=False)

ds.append_run({"a": 10, "b": 5})
ds.append_run({"a": 7, "b": None})

ds.run()
ds.wait(1, 10)

ds.fetch_results()

ds.results
appended run runner-0
appended run runner-1
Staging Dataset... Staged 2/2 Runners
Transferring for 2/2 Runners
Transferring 7 Files... Done
Remotely executing 2/2 Runners
Fetching results
Transferring 3 Files... Done
Warning! Found 1 error(s), also check the `errors` property!
[10]:
[50,
 RunnerFailedError('TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'')]
[11]:
ds.errors
[11]:
[None, "TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'"]

As expected, the 2nd runner failed.

Argument Fixes

Since the args themselves are at fault, and runners are responsible for holding the args, we should remove and replace this runner.

For this purpose, Dataset has a remove_run function:

[12]:
ds.remove_run({"a": 7, "b": None})
removed runner dataset-6fd64b82-runner-1
[12]:
True

For more information on running runners (and removing other bad data), see the Dataset Cleaning Tutorial

Submission Error - python

Supercomputers are often very specific about their environments and software. It’s very easy to specify an incorrect module, python version or submitter. This is often solved within the URL, however the issue can arise from the extra in the dataset or runner. In any case, simply updating the incorrect line and resubmitting is often enough to resolve the issues.

Lets set python to something that doesn’t exist to simulate this:

[14]:
from remotemanager import URL

url = URL(python="foo")
ds = Dataset(multiply, url=url, skip=False)

ds.append_run({"a": 10, "b": 5}, extra="bar")
ds.append_run({"a": 7, "b": 15})

ds.run()
ds.wait(1, 10)

ds.fetch_results()

ds.results
appended run runner-0
appended run runner-1
Staging Dataset... Staged 2/2 Runners
Transferring for 2/2 Runners
Transferring 7 Files... Done
Remotely executing 2/2 Runners
Fetching results
Transferring 2 Files... Done
Warning! Found 2 error(s), also check the `errors` property!
[14]:
[RunnerFailedError('dataset-6fd64b82-runner-0-jobscript.sh: line 6: foo: command not found'),
 RunnerFailedError('dataset-6fd64b82-runner-1-jobscript.sh: line 5: foo: command not found')]
[15]:
ds.errors
[15]:
['dataset-6fd64b82-runner-0-jobscript.sh: line 6: foo: command not found',
 'dataset-6fd64b82-runner-1-jobscript.sh: line 5: foo: command not found']

Error Investigation

Now here we know that the python was set to an incorrect value, but this is not always the case, so the error would need more investigation.

First off, the errors property only shows us the last line of the error. While this can be enough, lets see if there’s more to this particular error.

The ds.failed property returns a list of all runners that report is_failed=True. Runners also have a full_error property which will return the full contents of the error file for you:

[16]:
print(ds.failed[0].full_error)
dataset-6fd64b82-runner-0-jobscript.sh: line 3: bar: command not found
dataset-6fd64b82-runner-0-jobscript.sh: line 6: foo: command not found

Hey look, here we can see the extra “bar” string that we set in the runner extra. But no extra information about our “foo” error.

Lets fix that by setting python to something sensible. Lets also leave the bar untouched for now, to see what happens:

[18]:
ds.url.python = "python3"

ds.run(force=True)
ds.wait(1, 10)

ds.fetch_results()

ds.results
Staging Dataset... Staged 2/2 Runners
Transferring for 2/2 Runners
Transferring 7 Files... Done
Remotely executing 2/2 Runners
Fetching results
Transferring 4 Files... Done
Warning! Found 1 error(s), also check the `errors` property!
[18]:
[50, 105]

Since we didn’t update the extra="bar" line, we stil have an error there! This is important to display that just because there is an error, does not necessarily mean that the run has failed.

[19]:
ds.errors
[19]:
['dataset-6fd64b82-runner-0-jobscript.sh: line 3: bar: command not found',
 None]

Extra Fixes

However this is something that can be removed, simply updating the extra to None will remove this error:

[21]:
ds.get_runner(0).extra = None

ds.run(force=True, force_ignores_success=True)
ds.wait(1, 10)

ds.fetch_results()

ds.results
Staging Dataset... Staged 2/2 Runners
Transferring for 2/2 Runners
Transferring 7 Files... Done
Remotely executing 2/2 Runners
Fetching results
Transferring 4 Files... Done
[21]:
[50, 105]

Submission Errors - shell

Submission of a run requires more than a simple python command, there are in fact two more similar arguments: submitter, which is put into the master script, and shell, which is used to launch the master script.

If you suspect that your shell might be broken, there is a very simple way to see what was submitted:

[23]:
url = URL(shell="foo")
ds = Dataset(multiply, url=url, skip=False)

ds.append_run({"a": 10, "b": 5}, extra="bar")
ds.append_run({"a": 7, "b": 15})

ds.run()

ds.wait(1, 5)

ds.fetch_results()

ds.results
appended run runner-0
appended run runner-1
Staging Dataset... Staged 2/2 Runners
Transferring for 2/2 Runners
Transferring 7 Files... Done
Remotely executing 2/2 Runners
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[23], line 9
      5 ds.append_run({"a": 7, "b": 15})
      7 ds.run()
----> 9 ds.wait(1, 5)
     11 ds.fetch_results()
     13 ds.results

File ~/remotemanager/remotemanager/dataset/dataset.py:2388, in Dataset.wait(self, interval, timeout, watch, success_only, only_runner, force)
   2386 t0 = int(time.time())
   2387 # check all non None states
-> 2388 while not wait_condition():
   2389     dt = int(time.time()) - t0
   2391     if watch:

File ~/remotemanager/remotemanager/dataset/dataset.py:2361, in Dataset.wait.<locals>.wait_condition()
   2360 def wait_condition():
-> 2361     states = self._is_finished(force=force)
   2363     if only_runner is not None:
   2364         return only_runner.is_finished

File ~/remotemanager/remotemanager/dataset/dataset.py:2305, in Dataset._is_finished(self, check_dependency, dependency_call, force)
   2303         warnings.warn(msg)
   2304     else:
-> 2305         raise RuntimeError(msg)
   2307 if check_dependency and not dependency_call and self.dependency is not None:
   2308     self.dependency.check_failure()

RuntimeError: Dataset encountered an issue:
/bin/bash: foo: command not found

Our wait timed out, which means no output files were produced, a surefire indicator of an error. If not even an error file was produced, it’s very likely that the calculations were never submitted, something that’s caused by a broken launch command. We can check this with the run_cmd attribute:

[24]:
ds.run_cmd.sent
[24]:
'cd temp_runner_remote && foo dataset-6fd64b82-master.sh'

There’s a lot going on with this command, but all you really need to see here is the final section, where we can see our foo. This can be changed back to bash (or your preferred shell) via url.shell

Walltime Errors

Even if you make no mistakes on your end, it’s still possible for a run to time out. Or run out of memory. Or any other scheduler related issue. The fixes in this case are similar to the previous example. Bump up the walltime request if needed and resubmit, done.

To demonstrate this we’ll insert a string into the jobscripts that simulates a walltime issue, but also “hide” some “scheduler info” above.

[26]:
fake_walltime = '''
echo "{scheduler info}" >&2
echo out of walltime! >&2
exit 1'''

ds = Dataset(multiply, skip=False)

ds.append_run({"a": 10, "b": 5}, extra=fake_walltime)
ds.append_run({"a": 7, "b": 15})

ds.run()
ds.wait(1, 10)

ds.fetch_results()

ds.results
appended run runner-0
appended run runner-1
Staging Dataset... Staged 2/2 Runners
Transferring for 2/2 Runners
Transferring 7 Files... Done
Remotely executing 2/2 Runners
Fetching results
Transferring 3 Files... Done
Warning! Found 1 error(s), also check the `errors` property!
[26]:
[RunnerFailedError('out of walltime!'), 105]
[27]:
ds.errors
[27]:
['out of walltime!', None]

There’s our walltime line, perhaps the scheduler had more info for us?

[28]:
print(ds.failed[0].full_error)
{scheduler info}
out of walltime!

Seems it did, perhaps this content would give some advice useful for fixing your jobs (resource limits, etc).

Lets remove the walltime issue and resubmit. Here, we’re just removing the extra, but in your case it may be on the URL side of things.

Since only one job actually failed, we really only want to rerun that one. You can use the ds.failed property to do this for you:

[29]:
for runner in ds.failed:
    runner.extra = None
    runner.run(force=True)
Staging Dataset... Staged 1/2 Runners
Transferring for 1/2 Runners
Transferring 5 Files... Done
Remotely executing 1/2 Runners
[30]:
ds.wait(1, 10)

ds.fetch_results()

ds.results
Fetching results
Transferring 2 Files... Done
[30]:
[50, 105]

Command Errors

In the background, Dataset is using the provided URL to issue commands on the remote machine. Sometimes, these can be the source of the failure.

The URL Tutorial has a section on error handling, but lets cover how to access these tools from the Dataset.

Each Dataset will have a url property, even if not set (one pointed at localhost will be created for you). This can be accessed at any time to change things or check for issues.

[31]:
ds.url.host
[31]:
'localhost'

Arguably the most useful debugging tool is the cmd_history property. This allows you to check the commands sent, up to the cmd_history_depth (defaults to 10).

We can write some quick debugging code to go through the history and find a specific command.

Lets say we think there was a problem with rsync, all we need to do is iterate back through the history and see what’s there:

[32]:
transfer = None
for cmd in reversed(ds.url.cmd_history):
    if "rsync" in cmd.sent:
        transfer = cmd
        break

print(transfer.sent)
rsync -auvh --checksum temp_runner_remote/{dataset-6fd64b82-runner-0-error.out,dataset-6fd64b82-runner-0-result.json} /home/test/remotemanager/docs/source/tutorials/temp_runner_local/

This command was used to retrieve the log from the remote, you can also see what was returned by the command execution:

[34]:
print(transfer.stdout)
sending incremental file list
dataset-6fd64b82-runner-0-error.out

sent 187 bytes  received 38 bytes  450.00 bytes/sec
total size is 2  speedup is 0.01

And any stderr:

[35]:
print(transfer.stderr)

In this case we have none, but in the theoretical situation where the rsync has thrown errors, they will be printed in full here.

Combined Debugging

In many cases, your problem will require a mix of these tools and solutions. But with experience, hopefully you will find the data flow easy to follow. Some points to remember:

  • An error in the output does not necessarily mean a failed run, it could just be a warning.

  • Use the failed property in combination with the other runner-based tools to save having to search out the runners yourself.

  • ‘Runner.full_error` is invaluable in finding hidden parts to your errors.

  • Sometimes the url is at fault, check your cmd_history!

  • Failing this, the ds.run_cmd will let you see if your run ever ran in the first place.